In the age of agile software development, continous integration became an integral part of every development team. With the time, there may be so many test jobs and nightly builds running, that many teams are wasting much time in fixing the test instead of creating new features for their customers. On top of that, in the age of distributed systems, many build jobs might simply break because an external system isn't available or is behaving erroneous. These errors are very annoying because they occur sporadic and are very hard to trace back.
In this blog post, I want to present you some options to find weird behaving build jobs. For this, I've downloaded some build log dumps from Travis Torrent, unpacked them and read them into Pandas to get some insight about breaking build jobs.
As complex as build logs might look like, they are semi-structured texts that contain recurring information in each job run.
This means we can read a build log by a data analysis framework that can read in kind of structured data. Of course, we use Pandas for this to get our effective analysis of build logs started. In this example, I choose one of the biggest Travis CI build log archive for an Java application that I could find (I used a Google BigQuery to find the biggest ones). It's SonarQube – a static quality assurance tools for use in continuous integration (what a nice coincidence).
The idea of this notebook is to show you what kind of information you can extract from a build log file to get more insight on how your build is working or failing. I won't show you every information detail that you can extract from a build log file, because they tend to get very specific and highly dependent on your own situation. But with the "mechanics" I'll show, you can adopt some analysis to your specific needs as well.
We read in the all the files in from the build log dump with glob
. This allows us to quickly get the relevant files.
Note: Because our log files are all in the same directory, we don't need to use the recursive
feature that glob
provides to get files from subdirectories as well.
In [143]:
import glob
import os
ROOT_DIR = "C:/dev/data/build_logs/SonarSource@sonar-java-light"
ROOT_DIR = "C:/dev/data/build_logs/SonarSource@sonar-java"
GLOB_PATTERN = "*.log"
log_file_paths = glob.glob(os.path.join(ROOT_DIR, GLOB_PATTERN))
log_file_paths[:5]
Out[143]:
We import the raw data as soon as possible into a Pandas DataFrame to avoid custom Python glue code and to be able to use the "standardized" methods for data wrangling of the Pandas framework.
In [144]:
import pandas as pd
# set width of column for nicer output
pd.set_option('max_colwidth', 130)
raw_logs = pd.DataFrame(log_file_paths, columns=['path'])
raw_logs.head()
Out[144]:
We clean up these ugly, different, OS-specific file separators by using a common one.
Note: We could have also used os.sep
that gives us the OS-specific separator. In Windows, this would be \
. But if you plan to extract data later e. g. by regular expressions, this is getting really unreadable, because \
is also the character to escape certain other characters.
In [145]:
raw_logs['path'] = raw_logs['path'].str.replace("\\", "/")
raw_logs.head()
Out[145]:
The are many information in the file path alone:
Let's say we need that information later on, so we extract it with a nice regular expression with named groups.
In [146]:
# TODO: uses still regex, too slow? Consider "split"?
logs = raw_logs.join(raw_logs['path'].str.extract(
r"^.*" + \
"/(?P<jobname>.*)/" + \
"(?P<build_number>.*?)_" + \
"(?P<build_id>.*?)_.*\.log$", expand=True))
logs.head()
Out[146]:
In the case of the Travis build log dumps, we got multiple files for each build run. We just need the first ones, that's why we throw away all the other build logs with the same build number.
In [147]:
logs = logs.drop_duplicates(subset=['build_number'], keep='first')
logs.head()
Out[147]:
After dropping possible multiple build log files, we can use the build number as new index (aka key) for our DataFrame.
In [148]:
logs = logs.set_index(['build_number'], drop=True)
logs.head()
Out[148]:
So far, we've just extracted metadata from the file path of the build log files. Now we are getting to the interesting parts: Extracting information from the content of the bulid log files. For this, we need to load the content of the log files into our DataFrame.
We do this with a little helper method that simply returns the contents of a file who's file_path was given:
In [149]:
def load_file_content(file_path):
with open(file_path, mode='r', encoding="utf-8") as f:
return f.read()
We use the function above in the apply
call upon the path
Series.
Note: For many big files, this could take some time to finish.
In [150]:
logs['content'] = logs['path'].apply(load_file_content)
logs.head()
Out[150]:
Because it could get a little bit confusing with so much columns in a single DataFrame, we delete the path
columns because it's not needed anymore.
In [151]:
log_data = logs.copy()
del(log_data['path'])
log_data.head()
Out[151]:
Let's have a look at some of the contents of a build log file. This is where the analysis gets very specific depending on the used contiuous integration server, the build system, the programming language etc. . But the main idea is the same: Extract some interesing features that show what's going on in your build!
Let's take a look at our scenario: A Travis CI job of a Java application that's build with Maven.
Here are the first lines of one build log.
In [152]:
# TODO put every line in a new row
This could also take a while because we are doing a string operation which is slow.
In [166]:
entries_per_row = log_data[0:4].content.str.split("\n",expand=True)
log_rows = pd.DataFrame(entries_per_row.stack(), columns=["data"])
log_rows
Out[166]:
In [159]:
log_status = log_rows[log_rows[0].str.contains("Done.")]
log_status.head()
Out[159]:
In [ ]:
%matplotlib inline
log_status[0].value_counts().plot(kind='pie')
Browsing manually through the log, there are some interesting features. E. g. the start time of the build.
In [102]:
print(log_data.iloc[0]['content'][1400:1800])
So let's check for some errors and warnings. For this, we create a new DataFrame because it's another kind of information.
In [66]:
logs['finished'] = logs.content.str[-100:].str.extract("(.*)\n*$", expand=False)
pd.DataFrame(logs['finished'].value_counts().head(10))
Out[66]:
In [67]:
mapping = {
"Done. Your build exited with 0." : "SUCCESS",
"Done. Build script exited with: 0" : "SUCCESS",
"Done. Build script exited with 0" : "SUCCUESS",
"Your build has been stopped." : "STOPPED",
"The build has been terminated." : "TERMINATED",
"The build has been terminated" : "TERMINATED",
"Done. Your build exited with 1." : "ERROR",
"Done. Build script exited with: 1" : "ERROR",
"Your test run exceeded " : "ABORTED"
}
logs['finished_state'] = logs['finished'].map(mapping)
logs.loc[logs['finished_state'].isnull(), "finished_state"] = "UNKNOWN"
logs['finished_state'].value_counts()
Out[67]:
In [68]:
logs['start_time'] = logs['content'].str.extract(
r"travis_time:end:.*:start=([0-9]*),", expand=False)
logs['start_time'].head()
Out[68]:
In [69]:
logs['start_time'] = pd.to_datetime(
pd.to_numeric(
logs['start_time'])/1000, unit='us')
logs['start_time'].head()
Out[69]:
In [70]:
logs['end_time'] = logs['content'].str[-500:].str.extract(
r"travis_time:end:.*:start=[0-9]*,finish=([0-9]*)",expand=True)
logs['end_time'] = pd.to_datetime(pd.to_numeric(logs['end_time'])/1000, unit='us')
logs['end_time'].head()
Out[70]:
In [71]:
logs['duration'] = logs['end_time'] - logs['start_time']
In [72]:
successful_builds = logs[logs['finished_state'] == "SUCCESS"].dropna()
successful_builds['duration'].mean()
Out[72]:
In [73]:
len(logs)
Out[73]:
In [74]:
t = successful_builds['duration']
t.memory_usage()
Out[74]:
In [75]:
successful_builds['duration_in_min'] = successful_builds['duration'].astype('timedelta64[m]')
successful_builds['duration_in_min'].head()
Out[75]:
In [76]:
successful_builds_over_time = successful_builds.reset_index().set_index(pd.DatetimeIndex(successful_builds['start_time'])).resample('1W').mean()
successful_builds_over_time.head()
Out[76]:
In [77]:
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
successful_builds_over_time['duration_in_min'].plot(
title="Build time in minutes")
Out[77]: